Change point detection for clustered expression data

Background To detect changes in biological processes, samples are often studied at several time points. We examined expression data measured at different developmental stages, or more broadly, historical data. Hence, the main assumption of our proposed methodology was the independence between the examined samples over time. In addition, however, the examinations were clustered at each time point by measuring littermates from relatively few mother mice at each developmental stage. As each examination was lethal, we had an independent data structure over the entire history, but a dependent data structure at a particular time point. Over the course of these historical data, we wanted to identify abrupt changes in the parameter of interest - change points. Results In this study, we demonstrated the application of generalized hypothesis testing using a linear mixed effects model as a possible method to detect change points. The coefficients from the linear mixed model were used in multiple contrast tests and the effect estimates were visualized with their respective simultaneous confidence intervals. The latter were used to determine the change point(s). In small simulation studies, we modelled different courses with abrupt changes and compared the influence of different contrast matrices. We found two contrasts, both capable of answering different research questions in change point detection: The Sequen contrast to detect individual change points and the McDermott contrast to find change points due to overall progression. We provide the R code for direct use with provided examples. The applicability of those tests for real experimental data was shown with in-vivo data from a preclinical study. Conclusion Simultaneous confidence intervals estimated by multiple contrast tests using the model fit from a linear mixed model were capable to determine change points in clustered expression data. The confidence intervals directly delivered interpretable effect estimates representing the strength of the potential change point. Hence, scientists can define biologically relevant threshold of effect strength depending on their research question. We found two rarely used contrasts best fitted for detection of a possible change point: the Sequen and McDermott contrasts. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08680-9).

In the following additional material of the paper is provided. We show in additional figures to the biological data and additional figures to the simulation settings. In addition, we present also some R code.
Supplementary Section 2 provides three additional biological data sets.
Supplementary Section 3 provides additional information on the simulation setting. We repeated all settings to have a better overview over all simulations. Therefore, some figures are duplicated in the paper and in the supplement.
Supplementary Section 4 presents additional R code for our analysis. Please also consider the GitHub repository for the direct access to the R code: https://github.com/msieg08/clustered_data_changepoint_detection Supplementary Section 5 presents a small simulation study with different different litter variance on the course of the confidence intervals Supplementary Section 6 shows a flowchart of the methods from the simulation to the final figures.

Additional Information on the Biological Expression Example Data
The data structure was the same for all four data sets. Each data set consisted of gene expression data from multiple samples determined at 12 fixed time points plus the adult stage. The time points represent different developmental stages. The gene expression information was constrained to one gene measured in one organ per time series. Expression of a specific gene in a specific organ was measured in multiple mouse pups by multiple mothers from twelve days after coitus (E12, Theiler Stage TS20 ) onwards. The 12 fixed time points contained two embryonic, four fetal, six postnatal and the adult stage(s). No pup was included twice and each mother only had one litter, i.e. at each time point, the litters originated from different mothers.
The variance introduced from a varying litter is called the litter effect. Not including this information in the final model could lead to overdispersion [1]. From a statistical point of view, this means expression data gained from the same litter was dependent, but was independent between the litters. Hence, at each time point the data consisted of both dependent and independent data points. Additionally, expression information between different time points was independent.

Additional Figures
Supplementary Figure 1: Biological example data for Car9 expression in the developing kidney. Subplot a) shows the biological example data set. Each point in time on the x-axis represents a development stage. The development stages are independent. Each point represents a pup and each color a mother animal. The pups are nested into the mothers. We added three broader development stages

Additional Tables
Supplementary Table 1: Contrasts and estimates of figure 1. The table shows the numeric values form the kidney car9 example. The C column indicates the contrast, the ∆ the log mean change of the corresponding contrast C. The gray row indicates a possible change point by visual inspection of figure 1. A significant confidence interval does not include zero.

Additional Figures
Supplementary Figure 4: Confidence intervals of estimates from linear mixed model coupled with contrast matrix for historical data with no change point.           Figure a) shows the increasing points in time (x-axis) of the sampled historical data in association with gene expression activity (y-axis) with no expected change point. Each color is related to one mother mouse. Subfigures b), c) and d) show the estimates (x-axis) including confidence intervals for the observed contrasts (y-axis) with methods Changepoint, Sequen and McDermott, respectively. Tables   Supplementary Table 4: Contrasts and estimates to supplementary figure 4. The table shows the numeric values from the simulation. The C column indicates the contrast, the ∆ the log mean change of the corresponding contrast C. No change point was simulated. A significant confidence interval does not include zero.